K-vec: A New Approach for Aligning Parallel Texts
نویسندگان
چکیده
Various methods have been proposed for aligning texts in two or more languages such as the Canadian Parliamentary Debates (Hansards). Some of these methods generate a bilingual lexicon as a by-product. We present an alternative alignment strategy which we call K-vec, that starts by estimating the lexicon. For example, it discovers that the English word fisheries is similar to the French pêches by noting that the distribution of fisheries in the English text is similar to the distribution of pêches in the French. K-vec does not depend on sentence boundaries.
منابع مشابه
Aligning Noisy Parallel Corpora Across Language Groups : Word Pair Feature Matching by Dynamic Time Warping
We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment.
متن کاملChar_align: A Program for Aligning Parallel Texts at the Character Level
There have been a number of recent papers on aligning parallel texts at the sentence level, e.g., Brown et al (1991), Gale and Church (to appear), Isabelle (1992), Kay and Ro . . senschein (to appear), Simard et al (1992), Warwick− Armstrong and Russell (1990). On clean inputs, such as the Canadian Hansards, these methods have been very successful (at least 96% correct by sentence). Unfortunate...
متن کاملLongest Sorted Sequence Algorithm for Parallel Text Alignment
This paper describes a language independent method for aligning parallel texts (texts that are translations of each other, or of a common source text), statistically supported. This new approach is inspired on previous work by Ribeiro et al (2000). The application of the second statistical filter, proposed by Ribeiro et al, based on Confidence Bands (CB), is substituted by the applicatio...
متن کاملExtracting Recurrent Phrases and Terms from Texts Using a Purely Statistical Method
Most statistical measures for extracting interesting word pairs such as MI and t-score require a large corpus to work well. This paper evaluates some of the most widely used statistical measures and introduces a method that can identify significant bigrams in relatively small texts by adapting Fung and Church's (1994) K-vec algorithm, which was originally designed to extract word correspondence...
متن کاملAligning Turkish and English Parallel Texts for Statistical Machine Translation
This paper presents a preliminary work on aligning Turkish and English parallel texts towards developing a statistical machine translation system for English and Turkish. To avoid the data sparseness problem and to uncover relations between sublexical components of words such as morphemes, we have converted our parallel texts to a morphemic representation and then used standard word alignment a...
متن کامل